-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ Helm Chart for OpenVINO vLLM #403
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for contributing this. Seems it needs some fix to pass the CI. Please also see my comment.
Since @yongfengdu is also working on vllm related helm-chart, I'll let him comment more on this.
object: | ||
metric: | ||
# VLLM time metrics are in seconds | ||
name: vllm_ov_request_latency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where does this metric comes from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this resource as this metric was not setup. Anyway, this vllm-openvino resources will not be used and will be replaced by OpenVINO specific values file.
labels: | ||
{{- include "vllm-openvino.labels" . | nindent 4 }} | ||
data: | ||
MODEL_ID: {{ .Values.global.LLM_MODEL_ID | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think using global.LLM_MODEL_ID is a good idea. The charts in common directory is used as components to construct the e2e AI workload, so it's possible that the same vllm-openvino chart could be used twice in the e2e helm charts(e.g. chatqna, etc.) with 2 different models. So each vllm-openvino chart should have its own LLM_MODEL_ID settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I've updated this in new commits.
PORT: {{ .Values.service.port | quote }} | ||
HF_TOKEN: {{ .Values.global.HUGGINGFACEHUB_API_TOKEN | quote}} | ||
VLLM_CPU_KVCACHE_SPACE: {{ .Values.VLLM_CPU_KVCACHE_SPACE | quote }} | ||
HABANA_VISIBLE_DEVICES : {{ .Values.HABANA_VISIBLE_DEVICES | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Who is using HABANA_VISIBLE_DEVICES? On K8S, it's the habana k8s-device-plugin to allocate the habana device into the container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this variable after confirming not being used.
no_proxy: {{ .Values.global.no_proxy | quote }} | ||
HABANA_LOGS: "/tmp/habana_logs" | ||
NUMBA_CACHE_DIR: "/tmp" | ||
TRANSFORMERS_CACHE: "/tmp/transformers_cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use TRANSFORMERS_CACHE
any more for tgi, but does this required for openvino. For environment in configmap, please make sure that everything you set here is actually play a role for actual application in the k8s pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this deprecated variable.
replicaCount: 1 | ||
|
||
image: | ||
repository: vllm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where does this image comes from? Could you please provide the link to this image on docker hub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This image is yet not available on Dockerhub. The URL is dockerfile is added in description: vLLM OpenVINO Dockerfile.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In that case, we should defer this until the container image is available, otherwise, people can not use the helm chart directly.
TRANSFORMERS_CACHE: "/tmp/transformers_cache" | ||
HF_HOME: "/tmp/.cache/huggingface" | ||
{{- if .Values.MAX_INPUT_LENGTH }} | ||
MAX_INPUT_LENGTH: {{ .Values.MAX_INPUT_LENGTH | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on openvino documentation about environment variable, I don't think it recognize this MAX_INPUT_LENGTH env
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this variable.
MAX_INPUT_LENGTH: {{ .Values.MAX_INPUT_LENGTH | quote }} | ||
{{- end }} | ||
{{- if .Values.MAX_TOTAL_TOKENS }} | ||
MAX_TOTAL_TOKENS: {{ .Values.MAX_TOTAL_TOKENS | quote }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto as MAX_INPUT_LENGTH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the backend components, I just integrated the vllm helm charts one or two days ago, I think it should be able to support the vllm-openvino case with just a replacement of docker image(In this case, we can provide a vllm-openvino.yaml to specify different values there). Could you check this? (https://github.com/opea-project/GenAIInfra/tree/main/helm-charts/common/vllm)
For the microservice llm-vllm, since it's the same layer with current existing llm-uservice and providing same function, I am looking to see if that's possible to use just one. Even one further step, to use just one llm docker image to support both tgi and vllm backend (Now we have 2 images: llm-tgi and llm-vllm). This may need the llm.py changes before we can merge, so I'm ok now with one more llm-vllm-uservice.
It would be great to support more cases/scenarios with less helm charts to reduce the maintenance efforts.
helm-charts/chatqna/Chart.yaml
Outdated
@@ -9,9 +9,27 @@ dependencies: | |||
- name: tgi | |||
version: 0.9.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latest version is updated to 1.0.0, for all components.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. updated all the versions.
|
||
Helm chart for deploying a microservice which facilitates connections and handles responses from OpenVINO vLLM microservice. | ||
|
||
`llm-vllm-uservice` depends on OpenVINO vLLM. You should properly set `vLLM_ENDPOINT` as the HOST URI of vLLM microservice. If not set, it will consider the default value : `http://<helm-release-name>-vllm-openvino:80` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember hit a limitation of 15 characters for the Chart name, maybe this is no longer a limitation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was able to work with longer names, so doesn't seem like a limitation any more.
@@ -0,0 +1,68 @@ | |||
# OpenVINO vLLM | |||
|
|||
Helm chart for deploying OpenVINO optimized vLLM Inference service. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned earlier, please try using the vllm helm charts with vllm-openvino.yaml values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a separate values file for OpenVINO.
e357dfc
to
c06f613
Compare
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
4f88ece
to
815c51b
Compare
for more information, see https://pre-commit.ci
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
@krish918 Can you confirm that the vllm container image used by `vllm-openvino' chart is publicly available? Otherwise I don't think it's the right time to add a helm chart which is can NOT be installed. We should either ask the vllm upstream to release their container image, or our OPEA community to release the corresponding container image. K8S is not like docker-compose, we normally don't control the node where the vllm-openvino will be run, so we can't do the manual container image build process like the docker-compose usage scenario, unless we login into all the k8s cluster nodes and build the container image on every node. |
@krish918 Please refer to these 2 commit for my suggestions of reusing vllm and llm-uservice helm charts for openvino support. |
for more information, see https://pre-commit.ci
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
a74e68e
to
294f1a0
Compare
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
6598362
to
e890448
Compare
Signed-off-by: Krishna Murti <[email protected]>
@lianhao @yongfengdu @chensuyue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refer to #473
The CI was modified to test only with ci-*values.yaml files.
By default the new vllm-openvino-values.yaml will not be tested. It's up to you to decide whether to enable vllm-openvino test or not. If you want to enable it, add a link with "ln -s vllm-openvino-values.yaml ci-vllm-openvino-values.yaml"
For llm-ctrl-uservice, We have plan to merge this with llm-uservice(And also merge tei/teirerank), but I'm OK to add this for this time.
- name: vllm | ||
version: 1.0.0 | ||
repository: file://../vllm | ||
condition: autodependency.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
autodependency.enabled is no longer used, use vllm.enabled instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
export https_proxy=<your_https_proxy> | ||
|
||
helm dependency update | ||
helm install llm-ctrl-uservice . --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set LLM_MODEL_ID=${MODELNAME} --set vllm.LLM_MODEL_ID=${MODELNAME} --set autodependency.enabled=true --set global.http_proxy=${http_proxy} --set global.https_proxy=${https_proxy} --wait |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
vllm.enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
# This is a YAML-formatted file. | ||
# Declare variables to be passed into your templates. | ||
|
||
autodependency: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/autodependency/vllm/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Krishna Murti <[email protected]>
for more information, see https://pre-commit.ci
@yongfengdu - enabled CI for new values file. |
Signed-off-by: Krishna Murti <[email protected]>
for more information, see https://pre-commit.ci
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
Signed-off-by: Krishna Murti <[email protected]>
- name: llm-uservice | ||
version: 1.0.0 | ||
repository: "file://../common/llm-uservice" | ||
condition: tgi.enabled | ||
- name: llm-ctrl-uservice | ||
version: 1.0.0 | ||
repository: "file://../common/llm-ctrl-uservice" | ||
condition: vllm.enabled |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why you're adding wrappers?
They were removed over month ago for v1.1 (#474), are unnecessary, and LLM wrapper uses a langserve
component with a problematic license (opea-project/GenAIComps#264).
For LLM inference, two more microservices will be required. We can either use [TGI](https://github.com/huggingface/text-generation-inference) or [vLLM](https://github.com/vllm-project/vllm) as our LLM backend. Depending on that, we will have following microservices as part of dependencies for ChatQnA application. | ||
|
||
1. For using **TGI** as an inference service, following 2 microservices will be required: | ||
|
||
- [llm-uservice](../common/llm-uservice/README.md) | ||
- [tgi](../common/tgi/README.md) | ||
|
||
2. For using **vLLM** as an inference service, following 2 microservices would be required: | ||
|
||
- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto, why add wrappers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR is from 1.0 release time, so with some old code.
I think it's better to merge with #610 , or just simple changes to support openvino after 610 get merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, but note that I'm testing my PR only with vLLM Gaudi version.
I.e. currently both CPU and GPU/Openvino support need to be added / tested after it.
That PR has also quite a few comment TODOs about vLLM options where some feedback would be needed / appreciated.
- [llm-ctrl-uservice](../common/llm-ctrl-uservice/README.md) | ||
- [vllm](../common/vllm/README.md) | ||
|
||
> **_NOTE :_** We shouldn't have both inference engine deployed. It is required to only setup either of them. To achieve this, conditional flags are added in the chart dependency. We will be switching off flag corresponding to one service and switching on the other, in order to have a proper setup of all ChatQnA dependencies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why there could not be multiple inferencing engines?
ChatQnA has 4 inferencing subservices for which it is already using 2 inferencing engines, TEI and TGI.
And I do not see why it could not use e.g. TEI for embed + rerank, TGI for guardrails, and vLLM for LLM.
Please rephrase.
2. Please set `http_proxy`, `https_proxy` and `no_proxy` values while installing chart, if you are behind a proxy. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO duplicating general information to application READMEs is not maintainable, there are too many of them. Instead you could include link to general options (helm-charts/README.md
).
curl http://localhost:8888/v1/chatqna \ | ||
-X POST \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why add redundant POST
? -d
already implies that (see man curl
).
|
||
image: | ||
repository: opea/vllm-openvino | ||
pullPolicy: IfNotPresent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the value, it breaks CI testing for latest
tag (see #587).
|
||
image: | ||
repository: opea/llm-vllm | ||
pullPolicy: IfNotPresent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the value, it breaks CI testing for latest
tag (see #587).
openvino_enabled: true | ||
image: | ||
repository: opea/vllm-openvino | ||
pullPolicy: IfNotPresent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop the value, it breaks CI testing for latest
tag (see #587).
# We usually recommend not to specify default resources and to leave this as a conscious | ||
# choice for the user. This also increases chances charts run on environments with little | ||
# resources, such as Minikube. If you do want to specify resources, uncomment the following | ||
# lines, adjust them as necessary, and remove the curly braces after 'resources:'. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is obsolete. Resource request should match the actual usage (+ some space for growth), but there are some complications. See discussion in #431.
helm install chatqna chatqna --set global.HUGGINGFACEHUB_API_TOKEN=${HFTOKEN} --set global.modelUseHostPath=${MODELDIR} --set tgi.LLM_MODEL_ID=${MODELNAME} | ||
``` | ||
|
||
```bash | ||
# To use Gaudi device |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now that there's support for both TGI and vLLM, all these comments here could state which one is used, e.g. like this:
# To use Gaudi device | |
# To use Gaudi device for TGI |
Description
This PR introduces OpenVINO vLLM as a new inference generation microservice. It integrates with ChatQnA application, enhancing its capabilities to utilize either of TGI or OpenVINO vLLM as underlying inference engine. The decision to use TGI or OpenVINO vLLM can be made while installing the helm chart itself. This is achieved by turning off a flag corresponding to TGI and turning on a flag corresponding to vLLM service.
Example:
In above command,
tgi.enabled
flag is turned off which is honored by chart to avoid deploying TGI microservice. At the same timevllm.enabled
flag is switched to true, which enables deployment of vLLM service.To use the OpenVINO optimized vLLM, a separate values file is present which can be used to deploy vLLM OpenVINO microservice. When using OpenVINO values file, switching flags for
tgi
orvllm
is not required as they are done inside thevllm-openvino-values.yaml
file only.Example:
Issues
No known issues yet.
Type of change
List the type of change like below. Please delete options that are not relevant.
Dependencies
Following two docker images are required as new dependencies:
Tests
Two test pods for both the new microservices are part of newly added helm charts. These pods constituting basic
curl
tests for the testing connection and proper response from both the microservices. After spinning up these services, we can run following commands with appropriate values for placeholders to test the sanity of services: